Recent Advances in Vision Foundation Models


In conjunction with CVPR 2025

June 12th 2025 (1 p.m. CDT — 5 p.m. CDT)

Location: 401 AB, Music City Center, Nashville TN

CVPR 2025 Tutorial on "Recent Advances in Vision Foundation Models"

We present our CVPR tutorial proposal on Recent Advances in Vision Foundation Models, a topic that has garnered significant attention from the computer vision community. Our tutorial will cover the most advanced directions in designing and developing vision foundation models, including the state-of-the-art approaches and principles in (i) learning vision foundation models for multimodal understanding and generation, (ii) scaling test-time compute and enabling the self-training of foundation models to improve themselves on reasoning and perception, and (iii) physical and virtual agents based on vision foundation models that can take actions for robotics and in virtual environments.

Program (CDT)

You are welcome to join our tutorial either in-person or virtually via Zoom (Click into the CVPR2025 portal to find the Zoom link).

Afternoon Session
13:00 - 13:50 Advancing Multimodal LLMs: From Seeing to Understanding and Acting   [Slides]   Zhe Gan
13:50 - 14:40 Multimodal Reasoning for Visual-Centric Long-Horizon Tasks   [Slides]     Zhengyuan Yang
14:40 - 15:00 Coffee Break & QA  
15:00 - 15: 50 See. Think. Act. Training Multimodal Agents with Reinforcement Learning   [Slides]   Linjie Li
15:50 - 16: 40 Towards Multimodal AI Agent That Can See, Think and Act   [Slides]   Jianwei Yang
16:40 - 17:00 Closing Remarks & QA  

Organizers

Zhe Gan

Apple

Linjie Li

Microsoft

Zhengyuan Yang

Microsoft

Lijuan Wang

Microsoft

Contacts

Contact the Organizing Committee: vlp-tutorial@googlegroups.com